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(54) DEVICE AND METHOD FOR MAKING INFORMATION UNINDIVIDUALIZED 

(57)Abstract: 

PROBLEM TO BE SOLVED: To relate many identification 
data items to one individual without spoiling the capability of 
identifying the individual by generating two data sets by 
separating identification data from other data, and relating an 
identifier generated by providing only the individual 
identification information for a trust institution to the other 
data and generating unindividualized data. 
SOLUTION: A data provider 112 processes information 
inputted in the form of a database 111, separates 
identification data from data in data provider information 111, 
and sends the identification data 1 1 3 to a trust institution 
CTTP 1 1 6. The trust institution 1 1 6 sends back individual 
data having records including unique identifiers. The data 
provider 1 1 2 matches the unique identifiers to data with the 
inputted data provider information 1 1 1 and separates the 
unique identifiers related to other information into an 
unindividualized database 120. The unindividualized database 
is sent thereafter to a unique data user 1 1 8 for analysis. 
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[0006] 

[Means for Solving the Problems] This invention concerns a method to be implemented on a 
computer and an apparatus, that allow owners or providers of information that incorporate personal 
identifiers (data providers) to distribute the data to data users in a depersonalized form. "In a 
5 depersonalized form" means "without revealing the identity of the person to whom the data relates." 
The data is otherwise unchanged. Under the method of this invention, the data provider separates 
personal data from the remainder of the data and creates two data sets. Only personal identifying 
information is provided to the trusted third party (TTP). The TTP generates identifiers that can be 
substituted for all the data in the database that can be used to identify individuals, such as names, 

10 addresses and social security numbers. The TTP then processes the identifying information by 
collecting and storing the personal identifying information so that it can later tell whether identifiers 
generated by the data provider or the TTP relate to the same individual. The data provider relates the 
identifiers supplied by the TTP with the other data, and generates depersonalized data. The 
depersonalized data can be sent to data users for analysis. In this way, the data user can match 

15 separate records from multiple data providers with a single individual, and the data provider can 
guarantee that it will not distribute personal identifying information that can link a specific data 
record with an individual. 
[0007] 

[Embodiments of the Invention] To put it briefly, this invention is a method and an apparatus for 

20 processing confidential information that identifies individuals, allowing anonymous analysis of the 
data. In the embodiments of this invention explained below, the data provider in possession of a 
database that contains confidential information divides the information into two parts, identifying 
information and other information. Using the identifying information, the provider generates a 
unique identifier for its own use. The unique identifier is linked with the identifying information in 

25 the data provider's database. After this, the data owner tags the other information mentioned above 
with the unique identifier and supplies the tagged data to the data user. In the embodiments set out 
below, the unique identifier is generated by or registered with the trusted third party (TTP). The 
trusted third party (TTP) can match the identifying information received from the data provider with 
other identifying information already in the TTP's database. The TTP is an agency that is under a 

30 contractual agreement to protect the identifying information from disclosure and, on the other hand, 
to maintain and process the data as necessary. By matching the identifying information, the TTP can 
link multiple identifiers that have been connected to data from several providers. These links can be 
provided directly to data users, and the data users can correlate data from multiple sources. 
[0008] In this invention, the word 'depersonalization' is used to designate the processing step where 

35 identifying information is deleted from user data records and is replaced with unique identifiers. This 
word, as it is used in the technical field of data processing, includes the terms 'anonymization' and 
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'coding'. When data is anonymized or coded, all identifying information is deleted from the record 
and a truly random identifier is allocated to refer to the relevant person. In addition, the word 
'depersonalization' also includes the processing step of replacing personal identifying information in 
the data record with an identifier that is not truly random. This type of identifier may be, for example, 
5 a hash function value generated from a specified subset of the identifying information, or some other 
value. 

[0009] Figure 1 is a high-level data flow diagram 110 of an exemplary information network that can 
use the principles of this invention. In this example, a data provider 112 owns or controls a database 
114. The database 114 is organized as, for example, several data records. Each record includes at 
10 least one data field. Data for each person is stored as a single record, or is linked over several records. 
A field or part of a field in each record includes data that can be used to identify individuals, i.e. 
personal identifiable attributes. These attributes include, for example, 'name', 'address' and 'social 
security number'. Note that these are examples, and are not intended to be a complete list of all 
identifiable attributes. 

15 [0010] In addition to the identification of information, the database also includes other information 
about individuals. "Other information" may include, for example, medical information, financial 
data, buying information and website navigation data. Identifying information may also include 
non-identifying demographic data, such as a person's occupation, postal code or telephone area code. 
Depending on the type of other information in the database record, some of this demographic 

20 information may be classified as identifying information. For example, if the data records include 
highly sensitive medical information, the whole postal code may be considered identifying 
information, but a partial postal code, such as the first three digits of a five-digit postal code, may 
not be treated as identifying information. 

[0011] As the types of information that are considered identifying information vary with the type of 
25 data contained in the database, the data provider can decide which pieces of information in the 
individuals' records are considered identifying information and which pieces of information will be 
transferred for analysis by the data user. The data provider 112 makes a file 113 from the database. 
Each record in the file includes fields that have the identifiable attributes from each record in the 
database. The file 113 is sent to the trusted third party (TTP) 116. The TTP 116 creates unique 
30 identifiers linked with the identification attributes. These identifiers may be letters, numbers, a mix 
of alphanumeric characters, symbols, etc. If the data in the database is highly sensitive, it is possible 
to generate a unique identifier in a completely random and irreversible fashion, e.g. by taking the 
instantaneous value of the system clock register. If the data in the database is of low confidentiality, 
it is possible to generate a unique identifier from the identifying information by a reversible process. 
35 [0012] To generate the unique identifier, the TTP 116 firstly compares the identification data from a 
record in the file to the records in the internal database 115. The internal database 115 includes 
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identifying information that has been processed previously by the TTP. Each record in this database 
also includes a source identifier that identifies the data provider. The data provider owns data 
relating to the identification record and links it to other records in the database that contain matching 
identifying information. If the TTP can find a match in its internal database and the previous data 
5 source is the provider of the current data, the TTP 116 uses the previously allocated unique identifier 
as the identifier for the new data. If the source for the previous data is not the provider of the current 
data, or if the TTP cannot find a match for the data in its database, a new unique identifier will be 
generated for the data set. Each of the unique identifiers is specific to the data provider. 
[0013] By allocating separate unique identifiers to represent the same person with different data 

10 providers, the TTP ensures that one data provider cannot identify data owned by another provider. 
Each data provider has identifying information for all the people within its database, and so if the 
same unique identifiers were used across multiple providers, one provider could link its identifying 
information and identify information relating to depersonalized data owned by another data provider. 
In this way, the confidentiality of the data would be lost. 

15 [0014] When it extracts or generates a unique identifier, the TTP stores the identifier in the 
appropriate record field in the file 113. Once all of the records have been processed, the TTP116 
returns the file 113 to the data provider 112. The data provider generates a new database 120 that 
includes the records from the original database. Identifiable attributes are deleted from the original 
database and are replaced with the unique identifiers. The database 120 includes random identifiers 

20 that are based on data that have been determined not to have personal identifying attributes, and the 
database 120 is sent to the data user 118. The data user will have obtained useful data that has been 
depersonalized, but it will not have the ability to identify individuals that match a particular data set. 
[0015] For highly sensitive data, it is desirable that the TTP 116 protects the relationship between the 
personal identifying information and the unique identifiers. The random identifiers provided by the 

25 TTP 1 1 6 for this type of information are ideally random as a whole. Apart from the data provider 1 1 2 
and the TTP 116, no-one can relate the identifiers to particular individuals. The data provider 112 has 
the authority to grant permission, and only in situations where special permission has been granted 
will the data user be able to obtain the identifying information concerning the arbitrary information 
in its possession. 
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mmmm (ttp> 04j«s*t, £«4«i 

MfcflaSfiTV^. ftftW (TTP) {4, r-^r 

ww yfrwrn-tiimmftk . txizTT p«t 

£k#Ti*4. TTPii. l^iM#A?^S^I>;k^ 

fefttii^ z<7)-ux' ! iigi,zmixz(DT-?&m& 

J: Wit" 6 1 v ^ . 1 J: 6 ^*<0Tt *> «. «W8f 
IWWHItHfcS^ifcCJ: 1 ), TTPli, ^ 

itcor daw ^ t>coT-? izmmitt htvtzwkrtm 

[0008] r^Afll j fc H{4, ft 

mn. T-?w®nmttmxm^t>tit x o t, 
vfrt>mzix* mzyyy^mnm^K^m 

-tioZm^Xbti*. MIX. r#fflAftj tv^ 
gj±4 . K 1 5 V ^ AT'^r v «M7 k r - ^ W 3 - H 
rt«fflAMW0flH8t*iBW4«fflSi*tf. £«^-< 

[ 0 0 0 9 ] H Hi, *SffiH<OJia*f 
J Srfltf8*>/ h7-?WW U^T-^7n-01 1 0 

t**. iwwtijv^rtt. r-^rnAM^i 1 2(4 
r-^-xi i4^m**«i$ijtwt^. r-^^- 

XI 14(4. mUX. »f-^l/3-KtLt« 

t*tr. #A«r-^{i#-l/3-FktTffift$ii. 
* fc i4«i!<7) 1/ n - h* labfc -5 T U y 9 § *iT V ^4 . # 
r-'<07 ^ -;kK^7ti47 -/l^FcO^Si. fflA 

[0010] ffi#WiiSiJtjDiT, T-^<-Xi4ilA 

h^h>'- x g yf- ^ k I) . »8ijflf% 
J44Jt. ^OttAPMItr-^, «ditf, *>l»Acoii 

izx^x. z?)AnmmmcD^<r>M±mwmmk i 
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iawff»s*tr*&tii, ±mmm?m\>mk lx 

[00 1 llWMmvt>&b*tt>tiimtr>947 
te^-tK-xmzmhZtitzT-*?)?* 7k k t>t 

<0fr*W%.X'*Z>. *r-97u>M9\ 1 2(4, t-9 

ItW(TTP) 116(:I^W„TTP116lt 

^<7>?At, fit. flaRlf^XrA^Py?^ 

v ^-(ctt , ■Bjisw^rp * x ic i -> x mmmtp t>m 

tcom^^x^t. 

[0012] mtinmfrim&tttzMz, ttp 1 

1 1 5rt<9U3-H4?WSK«T-**fl3ttt*. fl* 

tf . t-^ToaM ^{±«8i|i/3- Htlffil-tiT-^ 

Hfc "J y^-tS . TTP^cort^r-^^ 

T-*«0ttl&£"C*S*£fcW;. TTP 1 1 6(4, wc 
ffj 0 ST feiifc-«WWBfFt«te^-^«)||WFi: 

%Xtt V £ fc (4T T P tf^CDT- 9 <-Xlft<7)T- 

[ooi3] amw-^twwfFtno swo 

fcfcJ:"3, TTP 14, *4r-^rnA>f^*»a!l<orn 
AM 4 -jTBf*S<t£r-^ * BJJOT&fcV^ t £ 

^^rcoA^fco^T^iiffissrw-r^w. mzm 
ut<nTu> u ¥{z$i\^x m-<?)-~n.<nmmm^ bti 
x^imsaz. k&rwu mmsmmz y l 



[0014] -StoiKJSiJ? fcJR 0 ffi L*fcJ44dW-& 
fc, TTP(4^^iJ^7r-f;H 1 3|*Ic7)^IS&^' 
3— F7 -f H CtSWJ . t^XVUa-wmW 
Ztllk. TTP1 i6tt7r>f;H 13Srr-^rn 
A^l 1 2fcK*. ¥-9TxuMMt. bkff)f~ 

-mmsmzwMztii. f-^-xi 

2 0(4, ■AW«M!tT*V^flBrS*ifc-5'-^taW 
<5>yj>%fflH*1iX: f-^^Xl 2 0(4r- 
^i-Hf'l 18fc»fe*i*. f-?a-f(±l«M§ 

[0015] R»tttf>fiV>T-*fc*tLT, TTP 1 1 

6 liflAttnn* «t -warn* t mmto&mtt 
tzkim*L\,\ z<n?4r<nmmz^ux\i. tt 

P 1 1 6 (~ 4 0««S*i.67y^A5:«Sl?l4a* U< 
tt^*fcUr5y^AT*£. f-?7n;qn l 2 
ifcliTTPl 1 6JaW4s KfcWfFt*6fflAfc« 
il£ * £ k ttT* =5: V \ f - 9 7u)M 9112 m 

mumzm. ftmmzftLfcttmz&^xn 

f {4* WBr^t 4 fitWf- ^ to we 

[0016] *^OflKW^WS^«ti3V^T(4, 
«A(4, T-^rnAM^t4oTm*§n, (W»sn 
•tv^T-^-xrtoS8S(ioU3-FS^rf Jnx. 
T±T'^fv:4-5t, TTP 1 1 6(4, «8S<?)t-^7* 

1 5\,Z-ftfr$hT-9\ZVV?-$ht&\Z, TTP 1 1 
6 (4, SfflKofcr - ^ t-SCT;W3" y XA ^I^tT 
4 . t-^ y )WfflR<07*nA*>f 9frh<r)r- 9 Zmtt 
S*^t(4, TTP 1 1 6*^Sk4S. 
[0017] *«»Ctt. ^<0-fcT^3'UXA^flJ 

m^mx'hh. ffifflft^&7)v-dyxM*. m. a. j a 

rofc42>, "Probabilistic Linkage of Large Public H 
ealth Data Files" (Statistics in Medicine, vol. 1 
4, John Wiley. PP 491-498 (1995)) iffiSfUtSfcfcfc: 
BK?S*U 4fc» I. P. Fellegi feti* "ATheoryof 
Record Linkage" (Journal of the American Statisti 
cal Association, vol, 64, No. 328, pp 1183-1210 (1 
969)) fcBS*lfciEW=WS3*VCV>6. *fefB*&- 

gcr^3-yxA(4, ifcje-aasr**. ior^yx 

r-^7-f-;l/H(4, f-^-xi 1 5frt><DT-9 
cot-^T«7 ^ -71- !> fc . frfctSftlRfeix 

7tr-^(4 , mmmzT-9'<-zmzT-9i)m& 

■timA^T-9Xh l . Si£-HRffitfflV^ i k*«T 
&I.^W^7-f-^r : -b-yh(4, 9Xh^-A, 7r 
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issf^a^a t v ^nmny * ->v y mmx* 4 . 

[0018] L*»UfcffiHRS»li, ^£&r-?i 
coi-KX co-^ , S w \- -fe y f — yV)-gcMijT* 

4. 

[00 1 93 *41 ■o^T-^-^SSSJSlTtilHHI- 
*. £0#ffiUU 1 99 9^11^1 5 Htamsft 

4. £Offl«fc»S3*iTVi*flm*, *W»«MttJ 
<t tf&Sifc W* 4 W&X'fo&fflt. 0 tfcv^T , £ £ leg |JB 

«£ fc ic a . mm^mm\t, 3 

Xf7 7^A,^0. i) commit, ii) S^Ofp 
|f s iii) 7*-*tf>iMRT*S. 

[ 0 0 2 0 ] %m. 

[0021] ( 1) JcJ:9*ife*i4J:a 

0 = w — * 

[0022] p-«* : T-*£*<ttIStt. 
-f (EER: Element Error Rate) **0 . 9 9tl±c0 
*£l±. p = l-EERT* l 5» *ftJatteD*£tt, P 
=0. 99-EERi:&4. 

[ 0 0 2 3 ] HR : Jtf$^**f#IEIIfc:£SU WH 

[0 0 24] (2) t*rtJ:d«r, 

AW= -l 0gl ^j (2) 

[0025] rti^hm : «fft?)»£A* B = 

{ (a, b) | aeAAbeB} , 

[0026] ^-gc : »r^B*WjEWfc:l±-&*LT 



(i) 



[0027] *-RS» : * ( 3 ) fcjtff J; 3 1/3 

WW- log,({5j] (3) 

[0028] SfX5-$ : ^< fc fc 1 
ft. «itf5rt'T*4. * (4) icwtf-idawww 

it*. 
""«« 

[0029] iLKr-7> : Eftco-irc* 0 . Sift 4 
[ 0 0 3 0 ] W : « ( 5 ) twtiaar, *»m 

- 1^ 

x ( =-2.* (5) 

[0031] #ifc£ : J5r4tf**tffcfettSg*«>VYf 

*ia> i -ot tdmM^mx-h 4 m 

[00323 5 yyAMH 0 ST : #^eWE*W*SS 
HKSHifcJSlrVC. #1 5 0 0»v7'O7;*>'R=i n 
t ( (u*P) +1) Ci")4«S*i4J:37*-^-b-» 
b FW £> K»4 l^a- F fctt 5 vy*m)m 0 ST^ix 
4. ££-C'R{iM*t UT#^45yyAit^. Uii 

( KTfcJS*** ) ±*fc » PJi 0 fc 1 <7)P H loffl^5it 

4. 

[ 0 0 3 3 3 BW : «*HRatflJJSS*l6HMBi. - 
oogxgco OlSH&io, 2JiiSj*WJ:tT"*4„ 

[0034] ±#:5U6) fc*S*l4cfc 0 ft. 
4= b £ 1 5 0 O^mHSU^T 

U500j 

[0035] £^7°n-feX-CfflV^tL4nyt°A-^fc 
tt**»frr * 4 a- ^i7t*iifl fcOJ: o &fc 

(i, UNIX (SIMS) ^ LV^ Window 
s (SgffiflD CinWindows98^NTfi, 

<, jffiu^tiiJava (mmm) Tt>«t^. 

[00363 r-^^WMk. 
£i07°o-fex^mi^.7 : -^r(i, Xj)%tltz7 r 4 )V<?) 

ii. ff«W»^@R**tr. lULh<^S»lis WiJfs 



m ( 6 ) 
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^ mmsmm^. ft^Mtn- m 

^\ feg. mt iioof±Fjf. nr. *u mm 

fctfTSS. Wilfs ft&li7r-xh*-A, ?xh 

[0037] wiMfeT-n-b^^ia. ^xnttr-m 

?*-AJi, i9*v*«fcJ8*;S*i*. £T 

r-Xh^-Aii^^Wb^tL, {BOB, RO 
B, ROBBY) = ROBERTi:&l>, BJfc i IXJI 9 

I 4 , ItSSfg C A S S ^IgtE L fc V 7 h *7 x 7* ft JB V YC 

[0038] fi&fWI. 

f 1 A„ =B e (" 



T. iicf & (chance agreement) l$Clo' 

[0039] Hfi^«WHrrn*X«* l Xr-y7° 
tt, f-^'y^l 5 0 0ff<?5ISH^U^n«y^te; 
4MHt * *> te&K$rJi*>»£-ife£-$- * -ItT*l> (H2 
C92 01-219, sU6) #B8) . 

y = int (l^] NR:f-«^rt«)^-K» (6) 

[ o o 4 o ] z<m.. y-X7r-f;Hix^-vy§it x 
W3- Ffcli 1 frfcUtfJHto^^AJfcWW 0 3T hfl 
h. *LXT~-9fmifi£s6&tlh. T—?mM. ffj 

«M/CV*4. *3 LTx. ^ii^ffWxdf *yztl 

(7) fc*3iiMtoWB93T6*i*. 
&) 



0 A Gn = Null fcitf/SfcttB^ Null (RfcfcL) <7) 
-1 A p ¥=B p ©t£ (*-&) 



££T\ A Ba ttUa-KA*»fe©n#BOK*Tf*>5. 

[004 1] ZCD'mtf-M.^lkl,ZT?-tZZtll bs [0042] m&kV&g&L&f-l&l-^yT-i? 

7°n-feX(i:KfSi: (Mitf , l 5 ) wfcfcTttMilSift SfcttUffittli. 3$ ( 8 ) oJ; d tztmzti 

£= v r , 1liru i„ yi («t*ftt©«*fl55F*a) t-fSi 

Wei .99 then \ - e . 
' "[*■»-« (a) 

[0043] 4fclifi«<O0f-^)U3- m 0 fctt*3*l* . 

(cov^««Rn*t«!MW MHMi, ( 9 ) OJ; 

^ = X Pcrcent AyeOTCOt ( 9 ) 

[ o o 4 4 ] p i xfH-imfr to , s ( i o ) s * . 

fcitf (li) <o#«£fflnT. ^-ifcfi^UfciV 

>F-a=iog^^-j do) 

(id 

[0045] -K«iffcoSiJDST*rn-bXcOft^ [ 0 0 4 6 ] A*Sitfc7 
JgHNi, A^Sfutr-^-fe-y f^TO, -SfciKM fcL Mf-^^-x 1 1 SfcowcfHliSfu -Hit: 
4t>0 (entities) <mm (action) T*4. i Dr-^fcJ: <9mtll l><0 (entity) #$£-ic& 
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Sfufc V 3- H fcji, -R L£#H3P n - F *> -« 
oaKO^ (U I D : unique identifier) #ffl«3 3T & 
r-^tJ: DH^ix^ i<W4*?#Si-fe«/ Mctt 
flqtU*VtH^fctt, $fc*U I D36*5y^Afc4jSS 
*U MOST Ml*. 9^A4«4*<^)8MK>T* 
=f U XA ^ 4*TS « . ±TKBB bfc J: 5 1 . r 

3*1*. *WIifWWbW»<«fV^fctt, «fi£«* 

^■ffi^fflv^T t> i v\ ifrLm$m-MZ'fo&z tit 

WK, ffilOifc* 1 ■ooMHFf £ti 1 AaWtfRBS 

co 04 7] u i DfjosTswaiits A^stuti- 

n - H ##!St-X/14;:v *4 *f-£4*i*v i t> *>£-»c 

^-xi l 5mXZHh> 
[0 04 8]2«eHRft». 

A -B <r>k£ 



[0 04 9] 7T-Xh*-J»tffflttZ-&lXKl 

[0050] 2o<0Wn- H#R£--S&OS**ttJl. 

gSfeWtSft*. 
[005 1 ] IRS-RSflL 

#l^n-h'*yF£«g§-f*£kT*;fc*. ZcvTntz 

jixn -y * yytmiti. wmv 3 - hwn » * y x 

-*JBW£8Tifcg§;h.*. U>L*#^Xn>y*yX 
SSWi, flAfcMLT^-^g^llrt^iK 

^>-stii, mar, tt^wtf, i^trs^Bfe 
[ 0 0 5 2 ] x n «y * >?T-~?/i<mm^fm-$> 

(12) t#BH3nfcv\ 



[0 0 5 3] *LT3$(13) IC^idt, ^T«i 
^ =1 W, 



(1 3) 



[0054] *<0», *kftV^S^>*>^d£ffi 
1/3- t*JS*>4>*ut«Klfc*f LTffPMStL 
* . «»*WfltHW* 4 ttifi* * k , fSffl^ a- 

jWfi£iB**^*&{±, ArtStutPa-Kli, ^4 

*. 

[00 55] fi^W&-aS«lis ^-®C<02OO7 ^ 
-4'K* l H-07 r -^tat*»ti*gL J: o fc-t4S» 

#12345678 9** 123456798 iiESSfl 
T V ^* k -f * k , ±TiiHJ L 7^3" U XAJi^Hft t 



(4. WW) 7 ^ KIRKOflmBERt ( E D : E d i t 
Distance)*}*. ffl<RttO»S*fiJffiT* 

[00 56] 02, 3, 4tJitX5tt» r-^XnA'-f 
XI 1 2#^T-?J--f 1 1 8^««ttc0«tvhf^^ 
l^T'eM-r&f^^. TTP 1 1 6SrfW1-4J3iJOHii 
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himtmu. m e mm ixajmrn-ti x o 

[0057] 02 (I^SJS^irm, T-*«ifc& 

1 1 2ii, f-^^ll lTATjS^dWMS 
U T-t^-xtWT-irfrbmkT—ir 1 1 3 

P 1 1 6{38&h.S. TTP 1 1 6(4, -SWUffl'J'F* 

1\ *LTr-?Sft# 1 1 2(4, MZtUZr-f* 

i 2 o taarts . zmmxfcztd-zT-? 

[0058] m2i,z^twmmmffi<7)BMX'te. tt 
p i i6tf-^-fi 1 8t<r>rsui, mm%n 
fflibw zcommcommii. *-«f-^7n^/ 
«fcor- * y -x , *« &t - fv-z»h 

cot- 9 £-ic£ ■fe^jfifS*»**«^(=IWC* & . £ 
- F jWH* fc, tJ* S> < J9J* *>ifll*#fctm3*lT \ \ 

muwkxhh . mmt. ggcomntMzmm 

OMTli, TTP1 1 6\m«&r-?V-Xipt>0) 
W3-HSr-Sc$-tt, t^T y-Xcorat^fflAfc^^ 

[ 0 0 5 9 ] 03 fcwrtm«=8r*^<Q»«i, JilT 
^>ji5iTH2K^-t>Ot»flr4. TTP 1 1 

6 {ir-^rnAM ^K-««WWFfc»tL«rV^ t 
*»S«T1i rn^W^l 1 2(4A7j£ 
fifc^-^-xtJEffll/C, 2o<7>t-?<-x££ 

fflAfc*f*& 5 y yj&wmn <£ o ^wm^hcoxh 
r>xxw minmii. t-?7uju vm-Afc 

\,\ W.2<rMX'\±. t-?7xv<.4 Jf\±V3-Y*-tX' 
izvy? IXH 9 , f-^rn^ nir-^-x l 
1 3c0l^n-Fi:T-^"<.-Xl 2 Ocottm-ll'?- 
H«jR*fc. ^lA^^T-ScoftS'Ff £i£(« . 



a. mmyy'A.£tzmwx%-oxi>^\ l*»u 
mmpzti&miizis^xcofrmmx'* i . 

[0060] r-^<-X 1 1 3 (4TTP 1 1 6 Off* 
3*1*. f-n-xil 6(4, ^-waKiJflH8**t 

WHTtSl^-FtfTTPi 1 6<o (H^Sti=5rV>) 
rtUf-^-^ua-Hlc-HWiia. ±3*0 J: 

[006 1 ] |H]B#fc, SMl7 ? -^* t TTPKjS^ft. m 
eyf-f&ittST-IK-X 1 2 Otfr-fJL—f 1 l 
8fcJS^£. r-^-X 1 2 0 £S(t&l> i: , r- 
f(4TTP 1 1 6*^fflWt§T-^3 1 0S:S 

TTPtJ: 9^Sitfc-StoIM9?*¥fc-R3-&$. 
f-^a-fi±TTP 1 1 6t i O^JKSflJt-S^ 
WmffSr-^^-^ 1 2 OcOjiSftl/n- h'^Jpx. 

2). 

[0062] 03 tSrf J/^f A««f-^7nA* 
-f^KfflV^n*«^ttt, TTP1 1 6CJ;0tSffi$ 

*ii«BW*7 f -^3 i on**;, mwf-fmx 

4 nzx ^mmtil-M^mwlWdi^a- 

r-^^L-HfT 1 8{±r-^»Pf<oirt:. ffiR^roA 
yfrhVT—? SrM3Sftft§ i fc ^'T-^ 2. . 04 ICS 

HWtSiD- *mtr>£V:\imkX'bh. t%h 

hmtc&COli^ mAtDi/ZTJxX-te. TTP 1 1 6tf 

-^a.-iri i8t»isitji^*«»&ifcT*6. 04 

Ttt T-^tM^iiliS'J'H^&TTP 1 1 6fcMI> 0 

ttp 1 1 6{ir-^S:-s:s-fr, -mvmmzm 

■tST-^4 10S:TTPl 16«>^Ba»(tK«. £ 

1 2fci 0^?ixl»^WA^T-^ 1 2 0 CD-SOU 

[0063] 04fc^>-X-rAT'te, fflBS-Tir-^ 
4 1 0(4, r-^rnAW^l 1 203RRj&**4fcTT 
P 1 1 6(cJ: ijr-^i— f i i 8(c«ft$ix, 4fctt 
r-^i— f 1 1 8(cJ;DS*$tLl>„ f-^7nAM 
^*»fer-^**S*Sil6i:, TTP(4-?-c7)t-^^- 
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L^Lf-^a-Wf-^^ftSL T — 9 

JL— f i4. £g«flX4 T-^TDAM yCDfrfrfi> 

[0 0 64] H5c7)i/XrA(i, H3£#SiLT±fcK 
0Ji Lfc y^fA t jaT»*ClKr 0 , <e<9flMT14» 
tTSffc^mriWU, TTP 1 1 6#r-* 

TP 1 1 6{i#^S*tO»JC^LTT-^a— ft 

SrHfrf I>t-? rwt-f y<r>7*mwx<rm!Kb 
I. 

[00 6 5] 01~5t^-^XxA<DVvfftfc£vv?: 

r-?J-—f t4, r-^HWl/CV^ftA^BM 
3WfiSFT»4. Witf, r-^a- if 1 

1 1 2f4TTP 1 1 6t«£4;t. r-^a.— f 1 1 

[ o o 6 6 ] H6 14, *mi<Djmm^&mmtm 
%9mmmi*-t. *mmmmxu. mmmi 1 

6(4#T-^rPA^yi 1 2 a. 1 1 2b£J:tf 1 1 

2cic. imxitrutxz'ftovyw^ThZ.v/ 

^*«firt-4«l»T-^-^l 15 a. 1 1 5bfcJ: 
I/115c £jgffitl> . »T-^'s-X 115a. 1 

1 5b£i;tXl 1 5cc7)#^(4, f-^7n^>f^l 1 

2 a. 112 bfcjtf 1 1 2 c fc*tt ftflAOMfflPTK 

ttfeJtffflAWMWm TTP116fcJ:»)»ftr3 
ft. «fPS*L*«ti*r-^-^l 1 5 

•ess. wf-?^-xi 1 5^4. f^o^aora 
t. mmz$t^ixtzzn£o%tfffl<7)v-zfrt>®t> 

■£?-?3_~fi i8taflW4£fc*«au r-^ 

A>f ^(i v n - j« « a h E7 ^ H t affi L . * 

*i<b ^MSAttro-fex fcA^rt-* . mMfcrn-fex 
(4. r-^j-HftiijftSSft-cv^flHii*, mm 
m (ttp) fci onfltsftfcf-^^-xicjsiitwa 

-XI 15 a. 1 1 5bfcJ:lXl 1 5cfcjl'?#>£>&^ 



cDtiJ7jfcbTjj|t*£ix6„ f-^7nA'^l 12a. 
1 12bfcJ;t/l 1 2ci4. 

Sft^un-b'^r-^-if 1 18fca*. 

[0067] ^tmftS^T-^SgCOV-X^ 

!)^*«ISttSfci(>, #f-?7wqn i 2 

a. 112b*$J:Vl 12c(4. ^—PTcuWom 
fflA-ffcrn-feX 116a. 116 bfe 1 1 6 c fcj: 
DfJO 4>n3tHS"jT-^tJ J: V-S<o*W!HF**tr 
7r^f/H:TTP 1 1 6tHft-tl>„ TTPHZtl^co 
7 r A /WSrfflMS #4 £ fc t J: 0 #r- ?7n; W /t 

4*. t>tih t . tt p i4ffi<or- * r o am ?frt> <n v 

[0068] fc£0JT'f4, r-^rnAM ^112 a{4 
«Ht-^&TTP1 16til«-r6 l lfcJ4*W 

uwfwi. ttp 1 1 6(4. mmrnxozmv- 

-t*. ^LTTTP 1 1 6J4. HUtTTP 1 1 6(0r 

{4. r-^a— f*^^hf#l». L*>L. -icoro-fe 
XtiflWMt (false positive) fflWi 0 tfttttt (fals 
e positive) ffiM^AW^ L^. 
[0069] 3«*W5WT*>*U4*. *^HJ««W 
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APPARATUS AND METHOD FOR DEPERSONALIZING INFORMATION 
BACKGROUND OF THE INVENTION 

The present invention concerns the depersonal bation of data associated with a particular 
individual and, in particular, a method for depersonalizing data from several sources without disclosing 
5 the personalized data. 

In modem society, information relating to specific individuals is obtained by numerous 
organizations. Healthcare, financial and commercial organizations such as hospitals, laboratories, banks, 
insurance companies and retailers own data thai could be used for research and development, marketing, 
and other business functions. There is, however a growing awareness for the necessity to maintain the 

10 privacy of the individuals connecting with the data In particular, information regarding an individual's 
health or financial status may be extremely sensitive. 

The analysis of this mfijrmation often requires accessing data from multiple sources. For 
example, a study to determine the effectiveness of a particular medication may need to access records 
from a group of caregivers that prescribe the medication and from a corresponding group of pharmacies 

15 who prescribe the medication. The data owned by each of (he data providers contains sensitive 

information that they may be unable to share with the data user who will be analyzing the information. 
While the varions data providers could remove any identifying information from their data and provide 
only the medical data to the data user, the data user would not be able to correlate the data from the 
various sources and, thus, would lose information that would be needed in the analysis 

20 Therefore, a need has arisen for a method for obtaining personal data from multiple sources 

without the ability to identify the individual associated with the data but with the ability to associate 
individual data items from multiple sources as relating to a single individual. 
SUMMARY OK THE INVENTION 

The present invention relates to a computer implemented method and apparatus that allows an 

25 owner or provider of data that contains personal identifiers (data provider) to distribute that data to a 
data user in a depersonalized form, i.e., without revealma the identity of the individuals associated with 
the data. The data is otherwise unchanged. According to this method, a data provider separates die 
personal ^formation from the other data to create two data sets, Only the personal identifying 
information is provided to a Trusted Third Party (TTP). The TTP generates an identifier ihat replaces 

30 any data in the database that can be used to identify' an individual, such as name, address or social 

security number. The TTP may also collect and store the personal identifying information so that it can 
process identifying information that it acquires in the future to determine if the identifiers generated by 
the data provider or by the TTP refer to the same individual. The data provider associates the identifier 
provided by the TTP with the other data to create depersonalized data that may be sent to a data user for 
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analysis. In this manner, different records from one or more data providers that refer to a single 
individual can be matched by the data user, and the date provider is assured that no personal identifying 
information is distributed that would link an individual to a particular data record. 
DETAIL DESCRIPTION OF THE DRAWINGS 
5 Figure 1 is a data How diagram which is useful for describing how data is transferred among the 

various parties in the subject invention. 

Figure 2 is a dataflow diagram which illustrates one exemplary data depersonalization method. 
Figure 3 is a data flow diagram that illustrates a second exemplary data depersonalization 

10 Figure 4 is a data flow diagram that illustrates a third exemplary data depersonalization method. 

Figure 5 is a dataflow diagram that illustrates a fourth exemplary data depersonalization 

method. 

Figure 6 is a data flow diagram that shows how multiple data providers may interact with a 
trusted third party to provide data that may be correlated by one or more data users. 

1 5 Figure 7 is a block diagram that shows an exemplary computer configuration that may be used 

to implement the methods described in Figures 1 through 6. 

Figure 8 is a flow-chart diagram of an exemplar)- method of Figure 6. 
Figure 9 is a flow-chart diagram of an exemplary method of Figures 3, 4 or 5. 
DETAILED DESCRIPTION OF THE INVENTION 

v.u Briefly, the present invention is a method and apparatus for processing sensitive information, 

that identifies a person, so that it may be used for anonymous data analysis. In Hie embodiments of the 
invention described below, a data provider, who owns a database containing sensitive information, 
divides the information into two parts, identifying information and other information. Using the 
identifying information, the provider generates, or has generated for it, a unique identifier that is linked 

M to the identification Information in the data provider's database. The data owner then tags the other 
information with this unique identifier and provides the tagged data to the data user. In each of the 
embodiments described below, the unique identifier is generated by or registered with a Trusted Third 
Party (TTP) who is able to match the identifying information received from the data provider to other 
identifying iruwrnation thai may already be in the TTP's database. A TTP is an entity that is under a 

30 contractual agreement to protect the identifying information from being disclosed, while maintaining 
and processing the data as necessary. By matching the identifying information, the TTP can link 
identifiers that arc associated with data from multiple providers. These links may be provided directly 
to the data users to allow the data users to correlate data from multiple sources. 
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In the subject application, the word "depersonalizing" is used to describe the process by which 
the identifying information is removed from a user data record and replaced by a unique identifier This 
term encompasses the terms "anonymiziug" and "encoding" as they are used in (he data processing arts. 
When data is anonymized, or encoded, all identifying information is removed from a record and a truly 
5 random identifier is assigned to represent the person In addition, the term "depersonalizing" also 
encompasses a process by which an identifier that is not truly random is replaces the personal 
identifying information in a datarccord. An identifier of ibis type may be, for example, a hash function 
value or other value produced from a predetermines subset of the identifying information. 

Fib. 1 shows a high-level data (low diagram of an exemplary information network, 1 1 0, with 

provider 1 12 owns or controls a database. 1 14, which, for example, is organized as a plurality of data 
records, each record containing one or more data fields. The data for each person may be kept in a 
single record or it may be linked across multiple records. Fields or portions of the fields in each record 
contain data that can he used tn identify the individual, namely, personal identifiable attributes. These 

1 5 attributes include, for example, "name," "address" and "social security number". This is an exemplary 
and not exhaustive listing of the identifiable attributes. 

In addition to the identifying information, the database contains other information about the 
individual. This "other information" may include, for example, medical information, financial data, 
purchase activity informalion or web-site navigation data. The identifying information may also include 

20 non-identifying demographic data, for example, the person's occupation, their postal code or their 

telephone area code. Depending on the type of "other information" in the database record, some of this 
demographic information may be classified as identifying information. For example, if the data record 
includes sensitive medical information then the entire postal code may be considered identifying 
information while a partial postal code, for example the first three digits of a five-digit zip code, would 

o.s not be identifying information. 

Because die type of information that may be considered to be identifying information varies 
with the type of data stored in the database, the data provider is best able to decide which information in 
the person's record is considered to be ideriti fyiiia information and which information may be passed on 
to a data user for analysis. The data provider 112 createsafile U3 from the database, each record of 

30 the file contains the fields having the identifiable attributes from each record in the database. The file 

1 1 3 is sent to a Trusted Third Party (TTP) 116. The TTP 1 1 6 creates a unique identifier to be associated 
with the identifying attributes. This identifier can be alphabetic, numeric, alphanumeric, symbolic and 
the like. If the data in the database is sensitive, the unique identifier may be generated in a totally 
random fashion and in a manner that cannot be reversed, for example by taking the instantaneous value 
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of the system clock register. If the data in the database is less confidential, the unique identifier may be 
generated from the identifying information by a reversible process. 

To generate Ihe unique identifier, the TTP 1 1 6 first compares the identifying data from a record 
in the fib to records in an internal database 1 13 that contains identifying information which has 
5 previously been processed by the TTP. Each record of this database also contains a source identifier 
that identifies the data provider, who owns the data associated with the identifying record, and links to 
other records in the database that canlaiii matchinc identifying information. If the TTP finds a match in 
its internal database and if the source ofrhe previous data is the supplier of the current data (hen Ihe 
'ITP 1 16 uses the previously assigned unique identifier as the identifier for the new data. If the source 

]<) of the previous data was not the supplier of the current data or if the TTP does not find a match for the 
dnta in its database a new unique identifier is generated for the data set Each unique identifier is 
specific to the data provider. 

By assigning a different unique identifier to represent the same person for respectively different 
data providers, the TTP ensures that one data provider can not identify any data owned by another 

15 provider. Because each data provider lias identifying infbrination for all of the people in its database, if 
the same unique identifier were used for multiple providers, one provider could link its identifying 
information to depersonalized data that is owned by a different data supplier. This may result in a 
breach of confidentiality for that data. 

After retrieving or creating the unique identifier, the TTP stores it into a field of the appropriate 

20 record in the filel 1 3. When all of the records lave been processed, the TTP 1 16 returns the file 1 13 to 
the data provider 112. The data provider creates a new database 120 containing the records of the 
original database from which the identifiable attributes are removed and replaced with the unique 
identifier. The database 120 containing the random identifiers along wim the oata not determined to be 
personal identifying attributes are then sent to the data user 1 IS. The data user now has useful data that 

35 has been depersonalized so that the data user does not havR the ability to identify an individual that 
matches a particular set of data. 

For sensitive data, it is desirable for the TTP 1 ! 6 to protect the relationship between the 
personal identifying information and the unique identifiers. For this type of information, the random 
identifiers provided by the TTP 1 Mi are desirably totally random; there should be no way for anyone 

31) Other than the data provider 1 12 or the TTP 116 to relate the identifier with the individual. Only in the 
circumstance where the data provider 1 12 has authority to grant and grants specific permission should 
the data user be able to obtain identifying information for any data in its possession. 
In this exemplary embodiment, an individual may have multiple records within the database owned or 
controlled by the data provider. In addition, as set forth above, the TTP 1 16 may have data on one 
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person from multiple data providers. In order to link newly received personal data to data already in the 
database 115, the TTP 116 executes a matching algorithm on the data thatit receives. In any scenario in 
which a data user requires data from multipl e providers, a TTP 1 1 6 is necessary. 

Many matching algorithms may be used ill the present invention. Exemplary matching 
algorithms arc disclosed in a paper by M A. Jaro entitled "Probabilistic Linkage of Large Public Health 
Data Files" StatislisiinJ^siiidiis, vol. 14, John Wiley, pp 491-498 (1995) and in an article by I. P. 
Fellegi et al. entitled "A Theory of Record Linkage" Journal nf the American Statistical Association, vol. 
64, No. 328, pp US3-1 210 (I960). The simplest matching algorithm is a deterministic match By this 
algorithm, individual data fields from the newly received personal data are compared to corresponding 
fields in the data from the database 115. If all of these fields match, then the newly received data is 
almost certainly for the person whose data is in the database. An exemplary set of fields thai may be 
used for a deterministic match arc Last Name, First Name, Address and Social Security Number. Other 
fields such as Telephone Number and Birth Date may also be used. 

Deterministic matching techniques may not identify all matches or even a large percentage of 
matches between two databases because of incomplete data or transcription errors. One method for 
enhancing deterministic matching techniques is to employ probabilistic techniques to determine the 
likelihood that two dissimilar fields match Another technique is to normalize the data, for example by 
expanding abbreviations and nicknames before performing the deterministic match or applying the 
probabilistic techniques. Yet another method is to analyze dissimilar fields in otherwise matching 
records by their edit distances to identify possible errors in transcription. 



copending U.S. patent application No. 60/165,121 filed 15 November 1999 and is one of many possible 
matching methods that may be used. The materials disclosed therein are incorporated by reference 
herein to the extent they are material to the understanding and practice of this invention, die exemplary 
matching technique comprises three steps, i) data standardization, ii) weight es 



The following definitions and abbreviations are used for this exemplary embodiment: 
H-Probability: The probability that any random element pair will match by chance, as given by et 



p-Probability: The reliability ofthe data element If the Element Error Rate is > .<W then p = \-F.ER- > 
Else p = . 99 -EER 
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Agreement: A condition such that a given element pair matches exactly and both elements are 
known A g = Q e 

Agreement Weight: The weigh! assigned to an element pair when they agree during the record 
matching process as shown in equation (2). 

Cartesian Product: The set of ordered pairs A * B - {(a, h) | a r- A A b c- «} 
Disagreement: A condition such that a given element pair does not exactly match and both 
elements are known A e *B ei 

Disagreement Weight: Tbe weight assigned to an element pair when they disagree during the 
record matching process as shown in equation (3). 

Element Brror Rate: The proportion of clement pairs where at least one element is unknown, 
e.g., null, as shown in equation (4). 

e= m^ (4) 

Aw 

Frequency Tabic: Summary of the number of times, or 
a variable occur 

Mean: Arithmetic average, as given in equation (5). 

No Decision: A condition sucb that a 
elements is unkiiown. 

Random Number Assignment: In the exemplary embodiment of the invention, every record in 
the data set is assigned n random number such tliat v blocks of approximately 1500 are created 
R = mt Rt/ * P)+ 1] where R is the resulting Random Number, U is the Upper Bound (defined below) 
and P is a random function that returns a value between 0 and 1 In the exemplary embodiment of the 
invention, P may be a pseudo random number generator. 

Threshold: The threshold utilized in probabilistic matching is a binit odds ratio with a range of 
-qd 2: x <: oc . 

Upper Bound: Number of strata such that the data set is divided into approximately equal rows 
of 1 500 as shown in equation (6). 



(C3) )00-3 24094 ( P 2 0 0 0 - 3 2 4 0 9 4 A ) 



Number of Records in Data Set > 
1500 ) 



As regards the computer arid machine language used iu this process, just about any piece of hardware 
capable of executing a fairly large number of calculations in short order will fill (he bill. Any current 
state-of-the-art PC or server could be used. As for the operating system, UNIX is preferred, but 
5 Windows 98 or NT for Windows or the like could be used. The source code can be written in any 
language, though Java if preferred 
Pata Standardization 

The first step of this process involves the standardization of data in an input file. This 
standardization is required for increased precision and reliability, Pas input file can contain any number 

to of variables of which one or more are or may be uniqu e to a particular data source such as an individual. 
Examples of useful variables arc: member identifier, drivers' license number, social security number, 
insurance company code number, name, gender, date of birth, street address, city, state, postal code, 
citizenship. In addition, some identifiers can lie further distilled down into ttieir basic, or atomic, 
components. For example, a name may be broken down into atomic components of first name, last 

15 name and middle initial. 

During the standardization process, all character data is preferably transformed to a single case, 
and all abbreviations or nick-names arc transformed to their longer forms. For example all letters may 
be transformed to uppercase. So for inslance, first names are standardized to uppercase, e.g.. (BOB, 
KOB, ROBBY) ROBERT. Common names for cities and streets may be tmnsfoimed to the postal 

70 code, e.g., in the U.S. to United States Postal Service standard. In the latter instance this can be 
performed using industry standard CASS certified software. 
Weight Estimation 



The first step in the exemplary weight estimation process is to determine the number of strata 
required such that the data set can be divided into approximately equal blocks of 1500 rows (Fig, 2 - 
201-219), see equation (6). 

. ( Number of Records in Data Set^l 

u = mt { isoo J (6) 



A data matrix is created containing a Cartesian product of recurds with a tandem number of I assigned. 




30 



The source tile is then scanned and the records are assigned a random number between I and U. 
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The resulting matrix is then scanned, back clement pair within each record pair is assessed and assigned 
a value as shown in equation (7). 
1 if ^ = j5 e< (Agreement) 

e =\Oif A = NuU and/org - Null (No decision) 

e. «. ^ 

- 1 if Ag m * B Cr (Disagreement) 
where ^ is the nth element from record A 

Once the matrix has been fully assessed, percentages for each g o arc tabulated and stored This 
process may be repeated for a number (e.g. 15) of iterations 

Mean percentages of Agreements and Mo Decisions are calculated for each data element. The p 
probability, or the reliability, for each data element is then calculated, see equation (8). 

U/s> .99 then 1- s < s > 
P=\ 

[else .99 - e 

The p probability, or the probability that element n for any given record pair will match by 
chance, is calculated see equation (9). 

From the p and p probabilities, the disagreement and agreement weight formula may calculated 
employing equations (10) and (1 1) respectively. 

Disagreement = Jogj j ( 10 ) 
Agreement --\og^j (11) 



The final stage of this process is the action of uniquely identifying entities within the input data 

Each record from the input file is evaluated against the reference database 1 15 to determine if 
ft) the entity represented by the data has been previously identified using a combination of deterministic 
and probabilistic matching techniques. If it is judged that the entity is already represented in the 
reference set, the input record is assigned the unique identifier (UID) from me reference record that it 
has matched against. If it is j udged that the entity represented by data is not yet in the reference set, a 
new UID is randomly generated and assigned. Random values may be generated using many different 
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algorithms. As set forth above, if the data is sensitive, it is desirable that the random identifier be truly 
random, generated, for example, using the instantaneous value of the system clock register. For less 
sensitive data reversible methods may be used. It is desirable, however, for the identifier to be unique; 
only one persou should be associated with any one identifier. This random identifier may be numeric, 
alphanumeric, or symbolic (e.g. a spatial pattern or hologram). 

After the UID assignment occurs, the input record is evaluated, in its entirety, to determine if 
the record is a unique representation of the enlily not already contained in the reference table. Tf it is a 
new record, then it is inseited into the reference database 1 15 for future use. 
Deterministic Matching Technique. 

The exemplary deterministic matching technique employs simple Boolean logic and is applied 
after the data has been standardized. Two records are judged to match if certain criteria are met, such as 



First Name Matches Exactly 
Last Name Matches Exactly 
Date of Birth Matches Exactly 

Social Security Number OR Member Identifier Matches Exactly 

If two records satisfy the criteria &r deterministic matching, no probabilistic processing occurs. 
However, if no deterministic match occurs, Hie input record is presented for a probabilistic match, 
ftobubihstie Matching Technique 

The first step in me probabilistic matching process is to build a set of candidate records Tram the 
reference table based on characteristics of specific elements of the input record. This process is referred 
to as blocking, the set of candidate records is referred to as the blocking table. All data sets do not use 
the same characteristics, the elements used in this process are determined through data analysis. It is 
suggested, however, that the bl ockins variables include those elements that are somewhat unique to an 
individual, e.g., social security number, or a combination ordale of birth and last name. 
Upon completion of the construction of me blocking table, each clement tor each candidate record is 
compared against its corresponding element from the input record. See equation (12) for the scoring 



Agreement Weight if j[ - g > 
Oi/A e = Null and/or Q & - Null 
Disagreement Weight if ^ * Q g 
^ is the nth element from record A 
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A composite weight is then calculated foi all candidate records, see equation (13). 




(13) 



The candidate record with the highest composite weight is then evaluated against a predefined 
threshold. If the weight meets or exceeds the threshold, the candidate record is judged to match the input 
5 record. If the weight does not exceed the threshold, it is assumed that the input record represents an 
entity not yet included in the reference set. 

Tltc exemplary matching technique does not attempt to determine whether two fields that 
disagree represent the same data. If, for example, because of a transcription error, a social security 
number of 1 23 45 6789 wei e recorded as 1 23 45 6798, the algorithm set forth above would indicate 
10 disagreement. One alternative enhancement to the algorithm set foith above may be to employ some 
measure of similarity such as Edit Distance between similar fields. For example, the social security 
numbers described abovo have an edit distance of oris because a digit substitution of the InsL two digits 
would produce the correct result. This measure of similarity may be employed, for example, as a part of 
the probabilistic process or as a post processing step to confirm that the result of the probabilistic 
15 process is correct. 



transfer of sensitive information from a data provider 1 1 2 to a data user 118. Although each of the 
embodiments includes a single data provider, it is contemplated that, except for Figure 2, all 



dutabascl 1 1 to separate the personal data 1 1 3 from the other data in the database. The personal data is 
25 senitDtheTTP 1 16 for processing, as described above. TheTTP 1 1 6 returns the personal data with 

each record now including a unique identifier. The data supplier 112 then matches the unique identifier 
to the data in the input database 1 U and separates the other information and the associated unique 
identifiers into a depersonalized database 120. This depersonalized database is then sent to the data user 
11 8 for analysis, 

30 In the exemplary embodiment shown in Figure 1, there is no direct communication between the 

TIP 1 16 and the data user 1 IX. This embodiment may be used where a single data provider includes 
multiple data sources and needs to match the data from the various data sources. One example of this is 
a hospital environment in which billing records, patient treatment records, pharmacy records, radiology 



Figures 2, 3, 4 and 5 show alternative i 



; for employing a TTP 116 in the anonymous 



20 




in Figure 2, a data supplier 1 1 2 processes input information in the 
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records and therapy records may be kept separately, perhaps by separate contractors. The hospital may 
want to match these records internally for its own use and may want to provide the data to an external 
data user. In this embodiment, the TTP I 16 matches the records from the various data sources and 
provides a single unique identifier for each person among all of the sources. 
5 The exemplary embodiment shown in Figure 3 differs from that shown in Figure 2 in that the 

TTP 116 does not communicate the unique identifier to the data provider. In this embodiment, the 
provider 1 1 2 processes its input database to generate two databases. One database, 1 1 3 has only 
identifying information and the other database has only the other information. The data provider 
assigns common identifiers to corres|Xxidiii« records in the two databases. These identifiers may be as 

10 simple as a record number or as complex as a random identifier for a particular individual . In the first 
instance, the data provider makes no attempt to link multiple records for the same person. In the second 
instance, the data provider has already linked the records and has placed the unique identifier for the 
person into both the records of the database 1 13 and the corresponding records of the database 1 20 
Where the data provider has assigned unique identifiers, the identifiers may be random, pseudo random 

15 or reversible. It is noted, however, that reversible unique identifiers may only be used in situations 
where at least some personal infbnnation may be disclosed. 

The database 1 13 is provided to the TTP 1 16 where it is processed, as described above, to 
match records having the same identifying information to each other and to records in the internal 
database (not shown) of the TTP 1 1 6. 

10 At the same time that the identifying data is sent to the TTP, the database 1 20 containing the 

other data is sent to the data user 118. After receiving the database 120, the data user waits to receive 
correlating data 3 1 0 from the TTP 1 16. This correlating data matches die record identifi ers or unique 
identifiers from the data provider to unique identi ilers generated by the TTP. The data user adds the 
unique identifiers generated by the TTP 116 to the appropriate records of the database 120 and 

25 processes the other information using the TTP unique identifiers. 

When the system shown in Figure 3 is used with multiple data providers, the correlating data 
310 provided by the TTP 1 Ifi may also include a table indicating correspondence among the unique 
identifiers or record numbers provided by the multiple data providers. Using this information, the data 
user 1 18 may ussoeiate data from the multiple providers before performing the data analysis 

30 Tie system shown in Figure 4 is similar to that described above with reference to Figure 2 except that, 
in the system of Figure 4, there is communication between the TTP 1 1 6 and the data user 118. In 
Figure 4, the data supplier sends the identifying information to the TTP 1 16 who matches the data, adds 
unique identifiers and sends the identifying information with the unique identifiers back to the data 
supplier 112. The data supplier then copies the unique identifiers from the identifying information 
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records to tho associated other information records and provides the otber information records to the 
data user 118. The data user 118 then receives correlating data (410) directly from the TIT 116. Inthis 
instance, the correlating information includes unique identifiers from other data suppliers that 
correspond to the unique identifiers in the depersonalized data 120 tot is provided by the data supplier 
5 112. 

hi the system shown in Figure 4, Ihis correlating data 410 may be provided by the TTP 1 16 to 
the data user 1 IX at the request of the duui provider 1 12 or it may be requested by the data user 118. 
When the data is requested by tho data provider, the TTP provides correlating information for all of the 
data suppliers in its database. When tha data user asks for data, however, it requests information from 

1(1 only those data providers from which it receives data. 

Figure 5 shows a system that is similar to the system shown in Figure 3 except that, rather than 
send all correlating daw to the data user, the TTP ] 16 sends correlating data to the data user 1 18 only in 
response to a specific request. As with the system shown in Figure 4, that request may be for only those 
data providers who supply data to the data user 118. 

1 5 In any of the systems shown in Figures 1 through 5, it may oc necessary for tho data user to 

identify the person whose data is being evaluated. If, for example, the data user 1 1 8 is processing 
medical data and identifies a life-threatening condition, the data user may need to notify the individual. 
In this instance, the data user may ask the data supplier for the identifying information. In situations 
where the unique identifiers being used by the data user do not match the identifiers held by the data 

70 provider, the data provider 1 12 may then authorize the Tr? 1 1 6 to divulge the information to the data 
user 118. 

In this embodiment, The Trusted Third Party 1 16 provides each data provider 1 12a, 1 12b and 1 12c with 
software and/or hardware that performs the depersonalizing process and a supporting database 1 ISa, 

75 HSband 115c thatholds the identified depersonalized data. Each database 115a, 115band 115c 

contains individual identifiable attributes and individual identifiers for the respective data provider 1 12a, 
112band ll?.c obtained from a central database 115 owned or controlled by the TTP 116. The central 
database 1 15 is populated with information obtained from authorized sources of such information 
during past processing For each record the data provider wishes to supply to a data user 1 18, the data 

30 provider extracts the identifying fields for the record and inputs them into the depersonalizing process. 
The depersonalizin* process assigns the raruinm identifier by matching the information held by the data 
user with information previously stored in the database provided by the Trusted Third Party. If no 
matching data is found in the respective database 115a, 115b and 115c, a unique and possibly random 
identifier is assigned and provided as output from the process. If a match with previously 
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depersonalised data is encountered, the unique identifier assisned initially is provided as output from 
the process. The data providers 112a, 1 12b and U2c substitute the unique identifiers for the individual 
identifiable attributes in the record to create respective depersonalized records. The data suppliers then 
send 1he depersonalized records to the data user 1 IS. 
5 In order to enable the linking of multiple sources of depersonalized data, each data provider 

112a, 112b and 112c supplies, to the TTP 1 16, a file containing the identifying data and the unique 
identifiers assigned by the data provider's depersonalizing process 1 16a, 1 16b and 1 16c. The TTP 
correlates these files to identify matches among the identifying information records provided by the 
respective data providers and stores the unique identifiers, with indications of any correlation, within the 

10 central database. When authorized by the data provider, the TTP may supply information to the data 
user showing the random identifiers from any of the data provider that relates to the same individual, 
thus allowing the data user to create a linked depersonalized database 120. 

In some instances, a data provider 1 12a will not supply the identifying data to the TTP 1 1 15. In 
this instance, the TTP 1 1 6 will maintain a central database that is pre-populated with data from public 

IS sources, such as telephone directories, and will supply the matching algorithms to the data provider. 
The TTP 1 1 6 will receive only those files from a data supplier that have been previously matched with 
the TTP 1 16 database. It is apparent that correlation of data within certain groups of individuals who do 
not exist in the public databases, such as children, may be excluded from the data user. However, the 
process favors lalsc negative correlation over false positive. 

20 A practitioner skilled in the art would recognize the many permutations of the basic concept of 

the present invention, that is, the use of a trusted third party with a data provider and a data user to 
depersonalise data as die data passes from provider to user. The embodiments described above are 
exemplary in nature, and do not constitute an exhaustive listing of the various ways this invention may- 
be implemented. 

7.5 figure 7 is a block diagram of an exemplary physical implementation of any of the information 

networks shown in Figures 1 through 6. The exemplary system is linked by a local area or wide area 
network 716 which may also be connected to a global information network, such as the Internet by a 
direct communications interface 718 and by removable media 722. The exemplary system shown in 
Figure 7 includes six processing systems, 710, 730, 740, 760, 7/0 and 780. Each of these systems may 

30 include any of the communication interfaces shown for processing system 710. Each of the systems 710, 
730, 740, 760, 770 and 780 has an associated database 712, 732, 742, 762, 772 and 7S2. The databases 
maintained by the data provider, data user and TTP may reside on any commercially available host 
computer, as cuirenlly known in the art. 
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The exemplary processing system 710 includes a host computer 714 and a network interface 
716 by which the host computer 714 may communicate with other data processing systems via a local 
area network, a wide area network or a global information network. As shown in Figure I, the host 
computer 714 communicates with the processing systems 740 and 730 viaa local area network (LAN) 
5 717. Computer 714 also uses the LAN 71 7 to communicate with aglobal information network server 
7S0 and, through the server 750 and global information network 752, to remote users 760 and 780. 
Ill addition to the network interface, the host computer 714 of the data processing system 710 includes a 
communications interface 718. for example, a modem, through which the processing system 710 may 
communicate with the remote user 770. The processing system /10 also includes an input/output (I/O) 

to processor 720 which is coupled to a removable media device 722, for example a diskette drive, through 
which the host computer can communicate with ouy other computer system that does not have a direct 
or indirect data communication path with the host computer 714. 

Each host computer may contain one or more processors (not shown), memory (not shown), 
input and output devices (not shown), and access to mass storage {not shown). Each processing system 

15 may be a single system or a network of computers, as currently known in the art The data providers, 
TIP and data users may exchange data over computer network such as IAN 71 7 or by physically 
transferring data on removable media 722 from location to location. The system may also be 
implemented across a global information network such as the Internet. The host computer and the 
global information network may also communicate with a plurality of remote users. 

20 The term "database" may be broadly interpreted to mean any database using records and fields, 

or their equivalent. The method is not limited by the high-level language used to code the data or the 
language used to code the programs which implement the required data processing. 
It is contemplated that the subject invention may be practiced in computer software executed by the data 
provider(a) 1 12, trusted third party 1 1 6 and data user 1 18. This computer software may be implemented 

25 on a carrier, such as a diskette, CD-ROM, DVD-ROM or radio frequency or audio frequency carrier 

Figures 8 and 9 are flow-chart diagrams which illustrate exemplary embodiments of the 
invention. Figure B illustrates a process such as that shown in Figure 6 and Figure 9 shows a process 
such as that shown in Figures 3, 4 or 5. 
30 lu Figure 8, ut step 8 1 0, the TTF 1 1 6 provides the encoding process and encoding database to 

two retailers, retailer 112a and retailer 1 12b. The retailers implement the process and database within 
their company. The databases 1 1 5a and 1 15b provided by Hie IT? 1 16 in this exemplary embodiment 
of the invention are pre-populated with information supplied from the TTP central database 1 15. The 
information provided does not include any unique identifiers, 
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At step 812, each of the retailers 112a and 112b extracts the individual demographic attributes and 
individual identifiers from each data record il wishes to sent to the data user 1 IS. in this example, a 
marketing agency. For each record, the information is processed through TCP's supplied encoding 
process. The encoding process, at step 814 assigns a unique identifier to each record. Next, at step 814. 
5 the retailers 1 1 2a and 1 1 2b create the depersonalized data by replacing the individual demographic 
attributes and individual identifiers with the single unique identifier provided by the encoding process 
and send the depersonalized data to the marketing agency 1 18. 

Next, at step 8 18, the retailers U2aand 112b send, to the TTP 116. the unique identifiers 
assigned for each record where they encountered a match during the encoding process execution. 
10 The TTP 1 1 6, at step 820 stores the unique identifier assignment information provided by the retailers 
1 12aand 1 12b in its central database 1 15. Also at step 820, the TTP 116 sends the unique identifiers 
for the retailers 1 1 2a and 1 1 2b, which link to the same individual, as the correlating information to the 
marketing agency 118. 

At step 822, the marketing agency links the data using the correlating information and performs 

15 its marketing study. This study is performed without the ability to identify any individual person. 

As illustrated by the arrow from block 822 to block 812, the process is iterative. Periodically, the TTP 
116 sends updates to the encoding; process and database to the retailers 1 12a and 1 1 2b. These updates 
result from updates / additions to the encoding process central database obtained by TTP 1 16. After 
processing these updates, the retailers 1 12a and 1 12b send back to the TTP 1 16 all unique identifiers 

70 that were previously assigned by the retailers to the newly supplied information. 

It is noted that in this embodiment of the invention, the retailers 1 12a and 1 12b never provided 
any identifiable retail information, The retail data provided by the retailers to the marketing agency had 
no individual identifiable attributes. Thus, the marketing agency 1 1 8 never knew the identity of the 
actual individuals. Nonetheless, the marketing agency 1 1 8 was able to use the power of the retailer's 

25 information to enhance marketing study capability. 

In the exemplary embodiment of the invention shown in Figure 9, a manufacturer 1 18 wishes to 
use the healthcare information of three local healthcare providers to identify the health habits of a 
specific disease state. Three data providers 112, Provider A ProYiderB and ProviderC have information 
which identifies the individual (for example: Member number, social security number, name, etc.). The 

30 mamifaoturor 1 IS, ProviderA, ProviderB and ProviderC contractually authorize a Trusted Third Party 
(TTP) 1 1 f> to encode the healthcare data using the healthcare data encoding process shown in Figure 9. 

At step 9 Id of this process, Provider A, ProviderB and ProviderC each extracts the individual 
identifiable information from their internal databases 1 1 1 of healthcare records into a file 113. At step 
912, ProviderA, ProviderB and ProviderC send the files to TTP 1 1 6. 



At step 914, the TTP 1 1 6 identifies each individual using it's matching process and assigns an 
Encoding Key to each record. At step 91 6, the TCP 1 1 6 sends the files with the corresponding 
Encoding Keys back to ProviderA, ProviderB and ProviderC. Next, at step 916, ProviderA, ProviderB 
and ProviderC replace the individual attributes for each record they wish to send to the rnanufacmrer 
5 II 8 with the encoding key received from the TTP 116. Also at step 918, ProviderA, ProviderB and 
ProviderC send the encoded healthcare information files to the manufacturer 118. Al step 920, the 
manufacturer receives the encoded healthcare information files and obtains the correlating data from the 
TTP 1 16. Finally, at step 922, the manufacturer 1 18 links the data from ProviderA, ProviderB and 
ProviderC and completes its study. It is noted that this study is completed without the maruifachirer 
10 being able to identity any person. 

While the invention has been described in terms of a number of exemplary embodiments, it is 
contemplated that it may be practiced as described above with variations that are within the scope of the 
appended claims. 



(6 2))00-324094 (P2000-324094A) 



What is CLaimcd: 

other data fields, in an information network comprising a data provider, a data user and a trusted third 
party, wherein tlie identifying information in each record identifies a person, said method comprising 
the steps of: 

a) separating the identifying information fields from the other data fields for each data record to 
generate identifying records; 

b) transferring a copy of the identifying records to the trusted third patty; 

o) associating, by the tamed third party, each of the identifying records with a unique identifier, 
wherein a respectively different unique identifier is assigned to each person identified Liy one or 
more of the identifying records; and 

d) transferring, by the trusted third party, tile unique identifiers 10 the data provider; 

e) associating, by the data provider, the other data fields with the respective unique identifiers to 



f) transferring, by each orthc data providers, the depersonalized data to the data user. 

2. A method according to claim 1 wherein the step of associating the identifying records 
by the trusted third part includes the step of generating a random identifier that cannot be used to 
recover any of the identifying information fields as the unique identifier. 

3. A method of distributing data records, which include identifying information fields and 
other data fields, in an information network comprising a plurality of data providers, a data user and a 
trusted third party, wherein the identifying information in each data record identifies a person, said 
method comprising the steps of; 

a) separating, by each of the data providers, the identhyinp; information fields from the other data 
fields for each data record to generate identify ing records; 

b) transferring, by cao!> of the data providers, a copy of the identifying records to the trusted third 

c) associating, by the trusted third party, each of the identifying records, with a unique identifier, 
wherein a respectively different unique identifier is assigned to each individual person identified 
by one or more of the identifying records; und 

d) transferring, by the trusted third party, the unique identifiers to the respective data providers 
from which the identifying records used to generate the unique identifiers were received; 

g, by each of the data providers, the other data fields with the respective unique 
rs to form depersonalized data; and 
g, by each of the data providers, the depersonalized data to the data user. 
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y one of claims 1-3 wherein the step of associating, by the 
trusted third party, each of the identifying records, with a unique identifier, includes the step of 
generating a random identifier that cannot be used to recover any of the identifying information fields as 
the unique identifier, wherein when the identifying information fields provided by more than one of the 
plurality of data providers corresponds to one person, respectively different unique identifiers are 
generated for each of the more than one information providers. 

5 A method according to any one of claims 1-4 wherein the step of associating, by the 
trusted third party each of the identifying recurds, with a unique identifier further includes the steps of: 

a) recording, by the trusted third paiiy, a correlation of each person for whom multiple unique 
identifiers are assigned to form correlating information; and 

b) transferring, by die trusted thin) party, the correlating information to the data user. 

fi A method according to any one of claims 1-5 wherein the step of transferring, by the 
trusted third party, the correlating information to the data user, includes Ihe steps of 

a) receiving, from the data user, a request for correlating information for specific ones of the 
plurality of data providers; and 

b) transferring the correlating information for only the specific ones of the plurality of data 
providers. 

7. A method of distributing a plurality of data records, which include identifying 
information fields and other data fields, in an information network comprising a plurality of date 
providers, a data user and a trusted third party, wherein the identifying information in each data record 
identifies a person, said method comprising the steps of: 

a) generating, by each uf the duta providers, a plurality of first unique identifiers from the 
identifying information fields of the plurality of data records; 

b) transferring, by each of the data providers, a copy of the identifying information fields from 
each of the plurality data records and a respective copy of each of the plurality of unique 
identifiers, as a respective plurality of identifying records, to the trusted third party; 

c) transferring, by cadi of the data providers, a copy of Ihe other data, fields from each of the 
plurality data records and a respective copy of each of the plurality of first unique identifiers, as 
a respective plurality of data records, to the data user, 

d) associating, by the trusted third party, each of the identifying records, with a second unique 
identifier, wherein a respectively different second unique identifier is assigned to each 
individual person identified by one or more of the identifying records; and 

c) transferring, by die trusted third party, the first unique identifiers and the second unique 
identifiers to the data user; 
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f) associating, by the data user, the other data records provided by the data provider with the 
unique identifiers provided by the trusted thiid party. 

8. A method of processing and distributing a plurality of data records, wherein each of the 
plurality of data records contains information used to identify a person, by a trusted third party, said 
5 method comprising the steps of: 

a) receiving, from a plurality of data providers, a copy of the plurality of identifying records, 

b) associating each of the identifying records, with a unique identifier, wherein a respectively 
different unique identifier is assigned to each individual person identified by one or more of the 
identifying records; 

10 c) matching records associated with a particular person among the identifying records provided by 

the plurality of data providers, to generate the second unique identifier which is the same for all 
identifying records provided by the plurality of data providers, and 

d) transferring the unique identi fiers to the respective data providers from which the identifying 
records used to generate the unique identifiers were received, 

15 9. A carrier containing a set of instructions for causing a general purpose computer 

network comprising a Jala provider, a dab user and a trusted third party, said network accessing a 
plurality of data records %vhich include identifying information fields and other data fields, wherein thu 
identifying information in each record identifies a person, to perform the following steps: 

a) separating the identifying information fields from the other data fields for each data record to 
20 generate identifying records; 

b) transferring a cupy of the identifying records to the ti listed third party; 

c) associating, by the trusted third party, each of the identifying records with a unique identifier, 
wherein a respectively different unique identifier is assigned to each person identified by one or 
more of the identifying records; and 

25 d) traiisfemtuj by the trastcd third party, mcu 

e) associating, by the data provider, the other data fields with the respective unique identifiers to 
form depersonalized data; and 

0 transfening, by each of the data providers, the depersonalized data to the data user. 

10. A carrier according to claim 9 wherein the step of associating die identifying records by 
30 the trusted third part includes the step of generating a random identifier that cannot be used to recover 

any of the identify-big information fields as the unique identifier 

11. A carrier containing a set of instructions for causing a network of general purpose 
computers comprising a comprising a plurality of data providers, a data user and a trusted third party, 
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accessing a plurality of data records which include identifying information and other Gelds, wherein the 
identifying information in each data record identifies a person, said instructions comprising the steps of: 

a) separating, by each of the data providers, the identifying information fields from the other data 
fields for each data record to generate identifying records; 

b) transferring, by each of the data providers, a copy of the identifying records to the trusted third 
party; 

c) associating, by the trusted third party, each ofthe identifying records, with a unique identifier, 
wherein a respectively different unique identifier is assigned to each individual person identified 
by one or moic of the identifying records; and 

d) transferring, by the trusted third party, the unique identifiers to the respective data providers 
from which tlie identifying records used to generate die unique identifiers were received; 

e) associating, by each of the data providers, the other data fields with the respective unique 
identifiers to form depersonalized data; and 

f) transferring, by each of the data providers, the depersonalized data to the data user, 

12. A carrier according to claim 1 1 wherein the step of associating, by the trusted third 
party, each ofthe identifying records, with a unique identifier, includes the step of generating a random 
identifier that cannot be used to recover any uf the identifying information fields as the unique identifier, 
wherein when the identifyins information fields provided by more than one ofthe plurality of data 
providers corresponds to one person, respectively different unique identifiers arc generated for each of 
the more than one information providers. 

13. A currier containing a set of instructions for causing a network of general purpose 
computers, said network comprising a plurality of data providers, a data user and a trusted third party, 
said network accessing a plurality of data records which include identifying information fields and other 
data fields, wherein the identifying information in each data record identifies a person, to perform a 
method comprising the steps of: 

a) generating, by each ofthe data providers, a plurality of first unique identifiers from the 
identifying information fields of the plurality of data records; 

b) transferring, by each ofthe data providers, a copy ofthe identifying information fields from 
each ofthe plurality data records and a respective copy of each of the plurality of unique 
identifiers, as a respective plurality of identifying records, to die Hasted third party; 

c) transferring, by each ofthe data providers, a copy ofthe other data fields from each Df die 
plurality data records and a respective copy of each ofthe plurality of first unique identifiers, as 
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d) associating, by the trusted third party, each of the identifying records, with a second unique 
identifier, wherein a respectively different second unique identifier is assigned to each 
individual person identified by one or more of the identifying records; and 

e) transferring, by the trusted third party, the first unique identifiers and the second unique 
identifiers to the data user; 

f) associating, by the data user, the other data records provided by the data provider with the 
unique identifiers provided by the trusted third party . 

1 4. The carrier of claim 13 further comprising instructions to perform the steps of matching 
records associated with a particular person among the identifying records provided by the plurality of 
data providers, to generate the second unique identifier which is the same for all identifying records 
provided by the plurality of data providers, wherein the matching is performed by the trusted third parry. 

15. A carrier containing a set of instructions for causing a general purpose computer 
accessing a plurality of data records, wherein each of the plurality of data records contains information 
used lo identify a person, by a trusted third party, to perform the steps of: 

a) receiving a plurality of identifying records from a first dala provider; 

b) associating each of the plurality of identifying records with a unique identifier, wherein a 
respectively different unique identifier is assigned to each person identified by one or more of 
the plurality of identifying records; and 

c) transferring the unique identifiers to the data provider. 

16. A carrier according to claim 15 wheiein the step of associating the identifying records 
includes the step of generating a random identifier that cannot be used to recover any of a plurality of 
identifying information fields as the unique identifier, 

17. A carrier containing a set of instruction for causing a general purpose computer 
accessing a plurality of data records wherein each of the plurality of data records contains information 
used to identify a person by a trusted third party, to perform the steps of: 

a) receiving, from a plurality of data providers, a copy of the plurality of identifying records; 

b) associating each ofthe identifying records, with a unique identifier, wherein a respectively 
different unique identifier is assigned to each individual person identified by one or more of 
the identifying records; 

c) matching records associated with a particular person among the identifying records 
provided by the plurality of data providers, to generate the second unique identifier which 
is the same for all identifying records provided by the plurality of data providers, and 

d) transferring the unique identifiers to the respective data providers from which Ihc 
identifying records used to generate the unique identifiers were received. 



18. A carrier according to claim 17 wherein the step of associating, by the trusted third 
party, each of the identifying records, with a unique identifier, includes the step of generating a random 
identifier that cannot be used to recover any of the identifying i nformation fields as the unique identi fier, 
wherein when the identifying information fields provided by more than one of the plurality of data 
5 providers corresponds to one person, respecti vely different unique identifiers are generated for each of 
the more than one information providers. 
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A computer implemented method allows an owner or provider of data that contains personal 
identifiers (data provider) to distribute that data to a data user in a depersonalized form, i.e., without 
revealing the identity' of the individuals associated with the dala. The data provider first separates the 
5 personal information from the other data to create two data sets. The persona) identifying information is 
then provided to a Trusted Third Party (TTP). The TTP associates a unique identifier with the 
identifying information. This unique identifier replaces any data in the database that can be used to 
identify an individual, such as name, address or social security number. The TTP may also collect and 
store the personal identifying information so that it can process identifying information that it acquires 

10 in the future to determine if the unique identifiers generated by the data provider or by the TTP refer to 
the same individual. The data provider associates its own uniyuc identifier or the identifier provided by 
the TTP with the other data to create depersonalized data that may be sent to a data user for analysis. In 
this manner, different records from one or more data providers that refer to a single individual can be 
matched by the dala user, and the data provider is assured that no personal identifying information is 

1 5 distributed that would link an individual to a particular data record. The TTP transmits information that 
correlales irnique identifiers from multiple dala providers to a data user. Each data provider transmits 
the depersonalized data, including the unique identifiers to die data user. The data user correlates the 



